This report is intended to explain the process that I followed in the dengue competition. The goal is to predict the number of dengue cases each week (in two different locations) based on environmental variables describing changes in temperature, precipitation, vegetation, and more.

You can read more about this competition Here

A. FIRST CLEANING

A.1. MISSING VALUES

Let’s explore where our missing values are located. For now, let’s use the na.locf function for replacing them (although we will need to use a better way in the near future).

# where and how many missing values 

missing_values <- data_sj %>%
  filter(year>2000) %>%
  select_if(is.numeric) %>%
  gather(key = "key", value = "val") %>%
  mutate(is.missing = is.na(val)) %>%
  group_by(key, is.missing) %>%
  summarise(num.missing = n()) %>%
  filter(is.missing==T) %>%
  select(-is.missing) %>%
  arrange(desc(num.missing)) 

ggplot(missing_values, aes(x=key, y=num.missing, fill=key)) +
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle=60, hjust=1)) 

data_sj<-na.locf(data_sj)
data_iq<-na.locf(data_iq)
rm(missing_values)

B. EXPLORATORY ANALYSIS

After renaming and transforming some variables (into factor/numeric/datetime), we start splitting the dataset in the two different cities.

B.1. TEMPERATURE

Distributions

Let’s compare the temperature distributions in Iquitos

#### 2.2. TEMPERATURE IQ & SJ _____________________________________________ ####

# Get the temperature variables
temp_var<-names(data[grep("temp", names(data))])
temp_var_k<-c("temp_air_mean_r","temp_air_avg_r","temp_dewpoint_r","temp_max_r",
            "temp_min_r")

# Change everything to celsius
data_iq[temp_var_k]<- kelvin.to.celsius(data_iq[temp_var_k], round = 2)
data_sj[temp_var_k]<- kelvin.to.celsius(data_sj[temp_var_k], round = 2)

rm(temp_var_k)

# Temperature variables train & validation 
geom.density.function(data=data_iq, variables=c(temp_var[1:10], "source"), 
                      fill="source")

Let’s compare the temperature distributions in San Juan

geom.density.function(data=data_sj, variables=c(temp_var[1:10], "source"), 
                      fill="source")

Relations

Let’s plot together all temperature variables in Iquitos

# Plot the progression of the temperature variables
plotly.line.function(data_iq,variables= c(temp_var[1:10], "total_cases", 
                                          "week_start_date"), x="week_start_date")

Let’s plot together all temperature variables in San Juan (we have divided by 10 the dependent variable for visualization purposes only)

datatemp_sj <- data_sj
datatemp_sj$total_cases<-datatemp_sj$total_cases/10

plotly.line.function(datatemp_sj,variables= c(temp_var[1:10], "total_cases", 
                                          "week_start_date"), x="week_start_date")
rm(temp_var, datatemp_sj)

B.2. HUMIDITY & PRECIP

Distributions

Let’s compare the humidity and precipitation distributions in Iquitos

# Get the humidity variables
humid_precip_var<-names(data[grep("humid|precip", names(data))])

# humidity variables train & validation 
geom.density.function(data=data_iq, variables=c(humid_precip_var, "source"), 
                      fill="source")

Let’s compare the humidity and precipitation distributions in San Juan

geom.density.function(data=data_sj, variables=c(humid_precip_var, "source"), 
                      fill="source")

Relations

Let’s plot together all humidity and precipitation variables in Iquitos

# Plot the progression of the humid & precip variables
plotly.line.function(data_iq,variables= c(humid_precip_var, "total_cases", 
                                          "week_start_date"), x="week_start_date")

Let’s plot together all humidity and precipitation variables in San Juan

# Plot the progression of the humid & precip variables
plotly.line.function(data_sj,variables= c(humid_precip_var, "total_cases", 
                                          "week_start_date"), x="week_start_date")
rm(temp_humid_precip)

B.3. VEGETATION

Distributions

Let’s compare the vegetation distributions in Iquitos

# Get the vegetation variables
var_veg<-names(data[grep("ndvi", names(data))])

# humidity variables train & validation 
geom.density.function(data=data_iq, variables=c(var_veg, "source"), 
                      fill="source")

Let’s compare the vegetation distributions in San Juan

geom.density.function(data=data_sj, variables=c(var_veg, "source"), 
                      fill="source")

Relations

Let’s plot together all vegetation variables in Iquitos (we have multiplied by 100 the vegetation variables for visualization purposes only)

# Plot the progression of the veg variables
dataveg_iq <- data_iq
dataveg_iq[var_veg]<-dataveg_iq[var_veg]*100
plotly.line.function(dataveg_iq,variables= c(var_veg, "total_cases", 
                                          "week_start_date"), x="week_start_date")

Let’s plot together all vegetation variables in San Juan (we have multiplied by 1000 the vegetation variables for visualization purposes only)

dataveg_sj <- data_sj
dataveg_sj[var_veg]<-dataveg_sj[var_veg]*1000
plotly.line.function(dataveg_sj,variables= c(var_veg, "total_cases", 
                                          "week_start_date"), x="week_start_date")
rm(var_veg,dataveg_iq, dataveg_sj)

C. REDUCING DIMENSIONALITY

C.1. REMOVE COLLINEARITY

Let’s remove those redundant predictors that are correlated with other predictors. In the table you can check those removed variables (using a threshold = 0.80)

final_variables<-data.frame(Iquitos = names(data_full), San_Juan = names(data_full))
rownames(final_variables)<- names(data_full)

final_variables$Iquitos<-ifelse(final_variables$Iquitos %in% names(data_iq), 
                                "Yes", "Removed")
final_variables$San_Juan<-ifelse(final_variables$San_Juan %in% names(data_sj), 
                                "Yes", "Removed")

# Replacing empty cells with icons
formattable(final_variables, list(
  Iquitos = formatter("span", style = x ~ ifelse(x == "Yes", 
    style(color = "green", font.weight = "bold"), style(color="red",font.weight = "bold"))),
  San_Juan = formatter("span", style = x ~ ifelse(x == "Yes", 
    style(color = "green", font.weight = "bold"), style(color="red",font.weight = "bold")))
  
))
Iquitos San_Juan
city Yes Yes
year Yes Yes
weekofyear Yes Yes
week_start_date Yes Yes
ndvi_ne Yes Yes
ndvi_nw Yes Yes
ndvi_se Yes Yes
ndvi_sw Removed Removed
precip_amt Removed Removed
temp_air_mean_r Yes Removed
temp_air_avg_r Removed Removed
temp_dewpoint_r Removed Removed
temp_max_r Yes Removed
temp_min_r Yes Removed
precip_kgperm2_r Yes Yes
humid_relative_r Yes Yes
precip_mm_r Yes Yes
humid_specific_r Yes Yes
temp_dir_range_r Yes Yes
temp_avg_st Yes Removed
temp_dir_range_st Yes Yes
temp_max_st Yes Yes
temp_min_st Yes Removed
precip_st Yes Yes
total_cases Yes Yes
month Yes Yes
source Yes Yes

G. COMPLETE CODE

If you want to check the complete code of this project, you can visit the repository on GitHub